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EXTRINSIC LOCAL REGRESSION ON MANIFOLD-VALUED DATA 


LIZHEN LIN, BRIAN ST. THOMAS, HONGTU ZHU, AND DAVID B. DUNSON 


Abstract. We propose an extrinsic regression framework for modeling data with manifold valued 
responses and Euclidean predictors. Regression with manifold responses has wide applications 
in shape analysis, neuroscience, medical imaging and many other areas. Our approach embeds 
the manifold where the responses lie onto a higher dimensional Euclidean space, obtains a local 
regression estimate in that space, and then projects this estimate back onto the image of the man¬ 
ifold. Outside the regression setting both intrinsic and extrinsic approaches have been proposed 
for modeling i.i.d manifold-valued data. However, to our knowledge our work is the first to take 
an extrinsic approach to the regression problem. The proposed extrinsic regression framework 
is general, computationally efficient and theoretically appealing. Asymptotic distributions and 
convergence rates of the extrinsic regression estimates are derived and a large class of examples 
are considered indicating the wide applicability of our approach. 

Keywords: Convergence rate; Differentiable manifold; Geometry; Local regression; Object 
data; Shape statistics. 


1. Introduction 


Although the main focus in statistics has been on data belonging to Euclidean spaces, it is common 
for data to have support on non-Euclidean geometric spaces. Perhaps the simplest example is to 
directional data, which lie on circles or spheres. Directional statistics dates back to R.A. Fisher’s 


seminal paper ( 

Fisher 

1953 

on analyzing the directions of the earth’s magnetic poles, with key 

later developments by 

Watson 

(1983), 

Mardia and Jupp (2000), Fisher et al. 

(1987) among others. 


Technological advances in science and engineering have led to the routine collection of more complex 
geometric data. For example, diffusion tensor imaging (DTI) obtains local information on the 


directions of neural activity through 3x3 positive definite matrices at each voxel (Alexander et al. 


2007). In machine vision, a digital image can be represented by a set of fc-landmarks, the collection 


of which form landmark based shape spaces (Kendall 1984). In engineering and machine learning, 
images are often preprocessed or reduced to a collection of subspaces, with each data point (an 
image) in the sample data represented by a subspace. One may also encounter data that are stored 


as orthonormal frames (Downs et al. 1971), surfaces, curves, and networks. 


Statistical analysis of data sets whose basic elements are geometric objects requires a precise math¬ 
ematical characterization of the underlying space and inference is dependent on the geometry of 
the space. In many cases (e.g., space of positive definite matrices, spheres, shape spaces, etc), the 
underlying space corresponds to a manifold. Manifolds are general topological spaces equipped 
with a differentiable/smooth structure which induces a geometry that does not in general adhere to 
the usual Euclidean geometry. Therefore, new statistical theory and models have to be developed 
for statistical inference of manifold-valued data. There have been some developments on inferences 
based on i.i.d (independent and identically distributed) observations on a known manifold. Such 
approaches are mainly based on obtaining statistical estimators for appropriate notions of location 
and spread on the manifold. For example, one could base inference on the center of a distribution 


on the Frechet mean, with the asymptotic distribution of sample estimates obtained (|Bhattacharya 


and Patrangenaru 

2003 

2005 

Bhattacharya and Lin 

2013 

). There has also been 

some considera- 

tion of nonparametric density estimation on manifolds 

(Bhattacharya and Dunson 

2010 

Lin et al. 

2013 

Pelletier 

2005 

)• 

Bhattacharya and Bhattacharya ( 

2012 

provides a recent overview of such 


developments. 
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There has also been a growing interest in modeling the relationship between a manifold-valued 
response Y and Euclidean predictors X. For example, many studies are devoted to investigating 
how brain shape changes with age, demographic factors, IQ and other variables. It is essential 
to take into account the underlying geometry of the manifold for proper inference. Approaches 
that ignore the geometry of the data can potentially lead to highly misleading predictions and 


inferences. Some geometric approaches have been developed in the literature. For example, Fletcher 


(2011) develops a geodesic regression model on Riemannian manifolds, which can be viewed as a 
counterpart of linear regression on manifolds, and subsequent work of Hinkle et al. (2012) generalizes 
polynomial regression model to the manifold. These parametric and semi-parametric models are 
elegant, but may lack sufficient flexibility in certain applications. Shi et al. (2009) proposes a semi- 


parametric intrinsic regression model on manifolds, and Davis et al. (2007) generalizes an intrinsic 
kernel regression method on the Riemannian manifold, considering applications in modeling changes 
in brain shape over time. Yuan et al. (2012) develops an intrinsic local polynomial model on the 
space of symmetric positive definite matrices, which has applications in diffusion tensor imaging. A 
drawback of intrinsic models is the heavy computational burden incurred by minimizing a complex 
objective function along geodesics, typically requiring evaluation of an expensive gradient in an 
iterated algorithm. The objective functions often have multiple modes, leading to large sensitivity 
to start points. Further, existence and uniqueness of the population regression function holds only 
under relatively restrictive conditions. Therefore, usual descent algorithms used in estimation are 
not guaranteed to converge to a global optima. 


With the motivation of developing general purpose computationally efficient, theoretically sound 
and practically useful regression modeling frameworks for manifold-valued response data, we pro¬ 
pose a nonparametric extrinsic regression model by first embedding the manifold where the response 
resides onto some higher-dimensional Euclidean spaces. We use equivariant embeddings, which pre¬ 
serve a great deal of geometry for the images. A local regression estimate (such as a local polynomial 
estimate) of the regression function is obtained after embedding, which is then projected back onto 
the image of the manifold. Outside the regression setting, both intrinsic and extrinsic approaches 
have been proposed for modeling of manifold-valued data and for mathematically studying the 
properties of manifolds. However, to our knowledge, our work is the first in taking an extrinsic 
approach in the regression modeling context. Our approach is general, has elegant asymptotic 
theory and outperforms intrinsic models in terms of computation efficiency. In addition, there is 
essentially no difference in inference with the examples considered. 

The article is organized as follows. Section [2] introduces the extrinsic regression model. In Section 
[3j we explore the full utilities of our method through applications to three examples in which 
the response resides on different manifolds. A simulation study is carried out for data on the 
sphere (example |3.1[ ) applying both intrinsic and extrinsic models. The results indicate the overall 
superiority of our extrinsic method in terms of computational complexity and time compared to 
that of intrinsic methods. The extrinsic models are also applied to planar shape manifolds in 
example |3.2[ with an application considered to modeling the brain shape of the Corpus Callosum 
from an ADHD (Attention Deficit/Hyperactivity Disorder) study. In example |3.3| our method is 
applied to data on the Grassmannian considering both simulated and real data. Section[4]is devoted 
to studying the asymptotic properties of our estimators in terms of asymptotic distribution and 
convergence rate. 


2. Extrinsic local regression on manifolds 

Let Y £ M be the response variable in a regression model where (M, p) is a general metric space 
with distance metric p. Let X £ R m be the covariate or predictor variable. Given data (xi,yi) 
(i = 1 ,...,to), the goal is to model a regression relationship between Y and A'. The typical 
regression framework with yi = F(xi) + is not appropriate here as expressions like yi — F(xi) are 
not well-defined due to the fact that the space M (e.g., a manifold) where the response variable 
lies is in general not a vector space. Let P(x,y) be the joint distribution of (A, Y) and P{x) be 
the marginal distribution of X with marginal density fx(x). Denote P(y\x) as the conditional 
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distribution of Y given X with conditional density p(y\x). One can define the population regression 
function or map F(x) (if it exists) as 

F(x) = argmin/ p 2 (q, y)P(dy\x), (2.1) 

q GM J M 

where p is the distance metric on M. 


Let M be a d-dimensional differentiable or smooth manifold. A manifold M is a topological space 
that locally behaves like a Euclidean space. In order to equip M with a metric space structure, 
one can employ a Ricmannian structure, with p taken to be the geodesic distance, which defines 
an intrinsic regression function. Alternatively, one can embed the manifold onto some higher di¬ 
mensional Euclidean space via an embedding map J and use the Euclidean distance || • || instead. 
The latter model is referred to as an extrinsic regression model. One of the potential hurdles for 
carrying out intrinsic analysis is that uniqueness of the population regression function in (2.11 (with 
p taken to be the geodesic distance) can be hard to verify. Le and Barden (2014) establish several 
interesting and deep results for the regression framework and provide broader conditions for verify¬ 
ing the uniqueness of the population regression function. Intrinsic models can be computationally 
expensive, since minimizing their complex objective functions typically require a gradient descent 
type algorithm. In general, this requires fine tuning at each step, which results in an excessive 
computational burden. Further, these gradient descent algorithms are not always guaranteed to 
converge to a global minimum or only converge under very restrictive conditions. In contrast, the 
uniqueness of the population regression holds under very general conditions for extrinsic models. 
Extrinsic models are extremely easy to evaluate and are orders of magnitude faster than intrinsic 
models. 


Let J : M — » E D be an embedding of M onto some higher dimensional ( D > d) Euclidean space 
E d and denote the image of the embedding as M = J(M). By the definition of embedding, the 
differential of J is a map between the tangent space of M at q and the tangent space of E D at 
J(q ); that is, d q J : T q M — > Tj^E d is an injective map and J is a homeomorphism of M onto its 
image M. Here T q M is the tangent space of M at q and Tj^E 0 is the tangent space of E D at 
J(q). Let || • || be the Euclidean norm. In an extrinsic model, the true extrinsic regression function 
is defined as 

F(x) = argmin I \\J(q) - J(y)\\ 2 P(dy\x) 

q£M JM 

= argmin l\\J(q) — z\\ 2 P(dz\x) (2.2) 

q&M Jm 

where P(- | x) = P(- | x) o J -1 is the conditional probability measure on J(M) given x induced by 
the conditional probability measure P(- | x) via the embedding J. 


We now proceed to propose an estimator for F(x). Let K : R m — > K. be a multivariate kernel 
function such that f Rm K(x)dx = 1 and f Rm xK(x)dx = 0. One can take K to be a product of to 
one-dimensional kernel functions for example. Let H = Diag(/ii,..., h m ) with hi > 0 (i = 1,..., to) 
be the bandwidth vector and \H\ = h±... h m . Let Kh{x) = and 


9, -v • K n{xi - x) y - J{yi )\\ 2 

F(x) = argmin ^-r- 

ye£ D 2^=1 F H {Xi - x) 


E 


J(yi)K H (xj - x) 

EILi K n(xi - x) ’ 


(2.3) 


which is basically a weighted average of points J(j/i),..., J(y n )- We are now ready to define the 
extrinsic kernel estimate of the regression function F(x) as 


F e (x) = J- 1 (V(%))) = J- 1 


argmin 11q - F(x)|| , 

V qeM J 


(2.4) 


where P denotes the projection map onto the image M. Basically, our estimation procedure consists 
of two steps. In step one, it calculates a local regression estimate on the Euclidean space after 
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embedding. In step two, the estimate obtained in step one is projected back onto the image of the 
manifold. 

Remark 2.1. The embedding J used in the extrinsic regression model is in general not unique. 
It is desirable to have an embedding that preserves as much geometry as possible. An equivariant 
embedding preserves a substantial amount of geometry. Let G be some large Lie group acting on M. 
We say that J is an equivariant embedding if we can find a group homomorphism cf> : G —> GL(D, R) 
from G to the general linear group GL(D,R) of degree D such that 

= <l>(g)J(q) 

for any g £ G and q £ M. The intuition behind equivariant embedding is that the image of M 
under the group action of the Lie group G is preserved by the group action of on the image, 
thus preserving many geometric features. Note that the choice of embedding is not unique and in 
some cases constructing an equivariant embedding can be a non-trivial task, but in most of the 
cases a natural embedding would arise and such embeddings can often be verified as equivariant. 

Remark 2.2. Alternatively, we can obtain some robust estimator under our proposed framework. 
The regression estimate is taken as the projection of the following estimator onto the image M of 
Ad after an embedding J. We can call it the extrinsic median regression model. Specifically, we 
define 


0/ x • v- h H (xi-x)\\y~ J{yi 

F(x) = argmm > ———— - r 

v&ev 2 Z l =i R H{x i - x) 


and F e (x) = J 1 I argmin||g — F(x)|| J . (2.5) 

V q ei J 


One can use the Weizfield formula ( Weiszfeld| 1937) in calculating the weighted median of ( |2.5| ) (if 
it exists). Such estimates can be shown to be robust to outliers and contaminations. 


Remark 2.3. A kernel estimate is obtained first in (2.31 before projection. However, the framework 


can be easily generalized using higher 

and Gijbels 

1996 

). For example, one 

projection. 1 

?hat is, for any x, let 


00,Pi) = argminV || J(yi) - P 0 - P\(xi - x)\\~ d\ H (xi - x). 

00,01 i— 1 


Then, we have 


F (x) = p 0 (x) 


F e {x) = J- 1 (v{F{x)) 


= J 1 argrnin | |q — F(x) 

V ,EM 


( 2 . 6 ) 

(2.7) 

( 2 . 8 ) 


The properties of the estimator F E (x) where F(x) is given by the general pth local polynomial 


estimator of J(y i),..., J(y n ) are explored in Theorem 4.4 


Note that our work addresses different problems from that of Cheng and Wu (2013), which provides 
an elegant framework for high dimensional data analysis and manifold learning by first performing 
local linear regression on a tangent plane estimate of a lower-dimensional manifold where the high¬ 
dimensional data concentrate. 


3. Examples and applications 

The proposed extrinsic regression framework is very general and has appealing asymptotic properties 
as will be shown in Section 4. To illustrate the wide applicability of our approach and validate its 
finite sample performance, we carry out a study by applying our method to various examples with 
the response taking values in many well-known manifolds. For each of the examples considered, 
we provide details on the embeddings, verify such embeddings are equivariant, and give explicit 
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expressions for the projections to obtain the final estimate in each case. In example |3.1[ we simulate 
data from a 2-dimensional sphere and compare the estimates from our extrinsic regression model 
with that of an intrinsic model. The result indicates that the extrinsic models clearly outperform 
the intrinsic models by orders of magnitude in terms of computational complexity and time. In 


example 3.2 we study a data example with response from a planar shape, in which the brain shape 
of the subjects are represented by landmarks on the boundary. Example |3.3| provides details of 
the estimator when the responses take values on a Stiefel or Grassmann manifold. The method is 
illustrated with a synthetic data set and small financial time series data set, both of which have 
subspace responses of possibly mixed dimension and covariates, which are the corresponding time 
points. 

Example 3.1. Statistical analysis on i.i.d data from the 2-dimensional sphere S' 2 , often called 


directional statistics, has a long history (Fisher 1953 Watson 1983 Mardia and Jupp 2000 Fisher 


et al. 1987). Recently, Wang and Lerman (2015) applied a nonparametric Bayesian approach to 
an example with response on the circle S 1 . In this example, we work out the details in an extrinsic 
regression model with the responses lying on a d-dimensional sphere S d . The model is illustrated 
with data {(xi,yi), i = 1,..., n}, where yi € S 2 . 

Note that S d is a submanifold of K d+1 ; therefore, the inclusion map i serves as a natural embedding 
onto R d+1 . It is easy to check that the embedding is equivariant with the Lie group G = SO(d- 1-1), 
the special orthogonal group of ( d + 1) by (d + 1) matrices A with AA T = 1 and |A| = 1. Take the 
homomorphism map from G to GL(d + 1,M) to be the identity map. Then it is easy to see that 
J(gp) = gp = <j>{g)J(j>), where g £ G and p £ S d . 


Given J(y i),..., J(y„), one first obtains F(x) as given in (2.3). Its projection onto the image M is 
given by 

Fe{x) = F(x)/\\F(x)\\, when F(x) / 0. (3.1) 


There are many well defined parametric distributions on the sphere. A common and useful distribu¬ 
tion is the von Mises-Fisher distribution (Fisher 1953) on the unit sphere, which has the following 
density with respect to the normalized volume measure on the sphere: 

Pmf(v, P, k) oc exp (np T y), 


where k is a concentration parameter with p a location parameter and E(y) = p holds. We simulate 
the data from the unit sphere by letting the mean function be covariate-dependent. That is, let 

(3 ox 

^ |/3 ° x\ ’ 

where j3 o x is the Hadamard product {Pxx 1 ,..., f3 m x m ). 


For this example, we will use data generated by the following model 

/3 ~AT 3 (0,7), x\ ~ N( 0,1), x 2 ~ 1V(0,1), x\ = x\ *x 2 , (3.2) 

y% ~MF(pi,K) , Pi = ,° X \ , i = 1 , • • ■ ,n, 

Ip ° Xi\ 

k some fixed known value. 


As an example of what the data looks like, we generate one thousand (n = 1000) observations 
from the above model with k = 10 so that realizations are near their expected value. Figure |T] 
shows this example in which 100 predictions from the extrinsic model are plotted against their true 
values using 900 training points. To select the bandwidth h we use 10-fold cross-validation with 
h ranging from [.1, .2,..., 1.9,2] and choose the value that gives minimum average mean square 
error. Residuals for the mean square error are measured using the intrinsic distance, or great circle 
distance, on the sphere. 

To illustrate the utility and advantages of extrinsic regression models, we compare our method to an 
intrinsic kernel regression model that uses intrinsic distance of the sphere to minimize the objective 
function. Computations on the sphere are in general not as intensive compared to more complicated 
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Figure 1. Left The training values on the sphere. Middle The held out values to 
be predicted through extrinsic regression. Right The extrinsic predictions (blue) 
plotted against the true values (red). 


manifolds such as shape spaces, etc, but it still requires an iterative algorithm, such as gradient 
descent, for the intrinsic model in order to obtain a kernel regression estimate. The following 
simulation results demonstrate extrinsic kernel regression gives at least as accurate estimates as 
intrinsic kernel regression but in much less computation time even for S 2 . 


Comparison with an intrinsic kernel regression model: The intrinsic kernel regression esti¬ 
mate minimizes the objective function f(y) = w id 2 (y, Vi), where y and j/j are points on the 

sphere S' 2 , Wi are determined by the Gaussian kernel function, and d(-, •) in this case is the greater 
circle distance. Then the gradient of / on the sphere is given by 


V/(y) = ^2wi2d(y,yi) 
2=1 


lOg yiVi) 

d{y,yi) 


y 2 Wi 

2=1 


arccos (y T yj) ( 


(y T yi)y), 


where \og y (yi) is the log map or the inverse exponential map on the sphere. Estimates for y can be 
obtained through a gradient descent algorithm with step size S and error threshold e. We applied 
the intrinsic and extrinsic models to the same set of data using the Gaussian kernel function. 


Twenty different data sets of 2000 observations were generated from the above sphere regression 
model with von-Mises Fisher concentration parameter k = {1,2,..., 20}. Of the 2000 observations, 
50 were used to check the accuracy of the extrinsic and intrinsic estimates. To see the effect of 
training sample size on the quality of the estimates, the estimates were also made on subsets of 
the 1950 training observations, starting with 2 observations and increasing to all 1950 observations. 
The same training observations were always used for both models. In both models, the bandwidth 
was chosen through 10-fold cross validation. The intrinsic kernel regression was fit with step size 
S = .01 and error threshold e = .001. The performance of the two models are compared in terms 
of MSE and predictive MSE. The MSE is calculated using the greater circle distance between 
predicted values and the true expected value, while predictive MSE is calculated using the greater 
circle distance between the predicted values and the realized values. The performance results using 
50 hold out observations can be seen in Figure [2j 


Predictive MSE does not converge to 0 because the generating distribution has a high variance; 
however, as the concentration increases, the predictive MSE does approach 0. The extrinsic and 
intrinsic kernel regressions perform similarly with large sample sizes. The extrinsic kernel regression 
drops in predictive MSE faster than the intrinsic model, which may stem from only having the kernel 
bandwidth as a tuning parameter which can be selected more easily than choosing the bandwidth, 
step-size, and error thresholds even through cross-validation. 


A significant advantage of the extrinsic kernel regression is the speed of computation. Both methods 
were implemented in C++ using Repp (Eddelbuettel and Frangois 2011), and resulted in up to a 
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Figure 2. The performance of extrinsic and intrinsic regression models on 50 test 
observations from sphere regression models with concentration parameters from 1 
to 20. Each color corresponds to a concentration parameter. The extrinsic and 
intrinsic models have similar performance in predictive MSE with low concentration 
parameters. However in terms of MSE, the extrinsic model appears to perform 
better with lower sample sizes even with lower concentration parameters. 


60x improvement in speed in making a single prediction using all of the training observations. For 
speed comparisons, a single prediction was made given the same number of test observations, and 
the time to produce the estimate was recorded. Each of these trials was done five times, and we 
compare the mean time to producing the estimate in Figure [3] 

Note that the same kernel weights are computed in both algorithms, so the difference is attributable 
to the gradient descent versus extrinsic optimization procedures. Since the speed comparisons were 
done for computing a single prediction and the difference is due almost entirely to the gradient 
descent steps, making multiple predictions results in an even more favorable comparison for the 
extrinsic model. This experiment shows that the extrinsic kernel regression applied to sphere data 
performs at least as well on prediction and can be computed significantly faster. 


Example 3.2. We now consider an example with planar shape responses. Planar shapes are 
one of the most important classes of landmark based shapes spaces. Such spaces were defined 
by Kendall (1977) and Kendall (19841 with pioneering work by |BooksteIn (1978) motivated from 
applications on biological shapes. We now describe the geometry of the space which will be used 
in obtaining regression estimates for our model. Let 2 = (z±, ..., Zk) with Z\,... ,Zk £ M 2 be a set 


of k landmarks. Let < z >= (z,...,z) where z = Y'jL, Zi/k. Denote u = --- which can 

Ik- < z > II 

be viewed as an element on the sphere S 2k ~ 3 , which is called the pre-shape. The planar shape T, k 
can now be represented as the quotient of the pre-shape under the group action by SO( 2), the 2 
by 2 special orthogonal group. That is, E k = S 2k ~ 2 ~ x /SO{ 2). can be shown to be equivalent 

to the complex projective space CP fe_2 . Therefore, a point on the planar shape can be identified 
as the orbit or equivalent of z which we denote by cr(z). Viewing z as elements in the complex 
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Figure 3. Speed comparisons between the extrinsic and intrinsic kernel regres¬ 
sions as a function of the number of training observations. The average seconds to 
produce an estimate for a single test observation are plotted in red for the intrinsic 
model, and black for the extrinsic model. The multiple between the speed for the 
intrinsic and extrinsic estimates plotted are also plotted for reference. 


plane, one can embed E \ onto the S(k,C), the space of k x k complex Hermitian matrices via 
the Veronese-Whitney embedding (see |Bhattacharya and Patrangenaru (2005), Bhattacharya and 


Bhattacharya (2012)): 


J(a(z)) = uu* = ((uiUj))i< t i t j<k- 


(3.3) 


One can verify the Veronese-Whitney embedding is equivariant (see Kendall (1984)) by taking the 
Lie group G to be special unitary group SU(k) with 

SU(k) = {Ag GL{k,C),AA* = I,det(A) = I}. 

The action is on the left, 

Aa(z) = a(Az). 

The homomorphism map <f> is taken to be 

</>: S(k,C) -> S{k, C) : cj){A)A = AAA*. 

Therefore, one has 

J(Acr(z)) = Auu* A* = cf>(A)J(a(z)). 































































































We now describe the projection after F(x) is given by (2.3), where J(yi) (i = 1,... ,n) are obtained 
using the equivariant embedding given in (3.31. Letting v T be the eigenvector corresponding to 
largest eigenvalue of F(x), by a careful calculation, one can show that the projection of F(x) is 
given by 

'Pj(M) = v T v. 

Therefore, the extrinsic kernel regression estimate is given by 

F e {x) = J-\v T v). (3.4) 


Corpus Callosum (CC) data set: We study ADHD-200 dataset 0 in which the shape contour 
of the brain Corpus Callosum are recorded for each subject along with variables such as gender, 
age, and ADHD diagnosis. The subjects consist of patients who are diagnosed with ADHD. 50 
landmarks were placed outlining the CC shape for 647 patients for the ADHD-200 dataset. The 
age of the patients range from 7 to 21 years old, with 404 typically developing children and 243 
individuals diagnosed with some form of ADHD. The original data set differentiates between types 
of ADHD diagnoses, and we simplify the problem of choosing a kernel by using a binary response 
for an ADHD diagnosis. 


According to the findings in Huang et al. (2015), there is not a significant effect of gender on 
the area of different segments of the CC; however diagnosis and the interaction between diagnosis 
and age were found to be statistically significant (p < .01). With knowledge of these results, we 
performed the extrinsic kernel regression method for the CC planar shape response using diagnosis, 
x 1 , and age, x 2 , for covariates. The choice of kernel between two sets of covariates X\ = (x\,xf) 
and X 2 = ( x \, x 2 ) is 

/h 2 if x\ = x\ 


K h (x 1 ,x 2 ) = 


(exp ( 


lo 



if x{ ^ x\. 


Although Huang et al. (2015) explores clustering the shape by specific diagnosis, we visualize 
how the CC shape develops over time by making predictions at different time points. We show 
predictions for ages 9, 12, 16, and 19 year old children of ADHD diagnosis or typical development. 
The results can be seen in Figure |4j 


What we can observe from the two plots is that the CC shapes for the 8 year olds seem to be close, 
but by age 12 the shapes have diverged substantially, with shrinking of the CC being apparent in 
later years in development. This quality of the CC shapes between ADHD and normal development 


is consistent with results found in the literature (Huang et al. 2015). 


In previous studies, ADHD diagnoses were clustered using the shape information to predict the 


diagnosis class, and the centroid of the cluster is the predicted shape for that class (Huang et al. 


2015). Our method adds to this analysis by taking the diagnosis and predicting the CC shape as 


a function of age. Our method also has the benefit of evaluating quickly, making selection of the 
bandwidth for the kernel through cross-validation feasible. 


Example 3.3. We now consider another two classes of important manifolds, Stiefel manifolds 
and Grassman manifolds (Grassmannians). The Stiefel manifold, 14 (K m ), is the collection of k 
orthonormal frames in K m . That is, the Stiefel manifold consists of the set of ordered fc-tuples of 
orthonormal vectors in R m , which can be represented as {A' £ S(m,k), XX T = I m }. The Stiefel 
manifold includes the to dimensional sphere S m as a special case with k =1 and O(m) the orthogonal 
group when k = to. Examples of data on the Stiefel manifold include the orbit of the comets and 
the vector cardiogram. Applications of Stiefel manifold are present in earth sciences, medicine, 
astronomy, meteorology and biology. The Stiefel manifold is a compact manifold of dimension 
km — k — k(k — l)/2 and it is a submanifold of R fcm . The inclusion map can be further shown to 
be an equivariant embedding with the Lie group taken to the orthogonal group O(m). 


^http^/fcon.lOOO. projects, nitrc.org/indi/adhd200/ 
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CC Shape age 9 


CC Shape Age 12 



CC Shape age 16 




CC Shape age 19 



Figure 4. Predicted CC shape for children ages 9, 12, 16, and 19. The black shape 
corresponds to typically developing children, while the red shape corresponds to 
children diagnosed with ADHD. Kernel regression allows us to visualize how CC 
shape changes through development. Here sections of CC appear smaller in ADHD 
diagnoses than in normal development. 


Given F(x ) obtained by kernel regression after embedding the points yi,...,y n on the Stiefel 
manifold to the Euclidean space R fcm , the next step is to obtain the projection of F(x) onto 
M = J(M). We first make an orthogonal decomposition of F(x) by letting F(x) = US, where 
U £ Vk,m , which can be viewed as the orientation of F{x) and S is positive semi-definite, which 
has the same rank as F(x). Then the projection of F(x) (or projection set) is given by 

Vm(F{x)) = {U £ V k , m : F(x) = H(%) T %)) 1/2 }. 


See Theorem 10.2 in Bhattacharya and Bhattacharya (2012) for a proof of the results. Then the 
projection is unique, that is, the above set is a singleton if and only if F{x) is of full rank. 


The Grassmann manifold or the Grassmannian Gr k (M. m ) is the space of all the subspaces of a fixed 
dimension k whose basis elements are vectors in R m , which is closely related to the Stiefel manifold 
Vk.m- Recall a subspace can be viewed as the span of an orthonormal basis. Let v = {iq,..., v k } 
be such an orthonormal basis for a subspace on the Grassmannian. Note that the order of the 
vector does not matter unlike in the case of Stiefel manifold. For any two elements on the Stiefel 
manifold whose span corresponds to the same subspace, there exists an orthogonal transformation 
(mapped by a orthogonal matrix in 0(h)) between the two orthonormal frames. These two points 
will be identified as the same point on the Grassman manifold. Therefore, the Grassmannian can be 
viewed as the collection of the equivalent classes on the Stiefel manifold, i.e., a quotient space under 
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the group action of O(k), the k by k orthogonal group. Then one has Grfc(R m ) = L4(M m )/0(k). 
There are many applications of Grassmann manifolds, in which the subspaces are the basic element 
in signal processing, machine learning and so on. 


The equivariant embedding for Grfc(R m ) also exists (Chikuse 2003). Let X £ 14 ,m be a represen¬ 
tative element of the equivalent classes in Grfc(R m ) = 14(R m )/0(fc). So an element in the quotient 
space can be represented by the orbit a(X ) = XR where R £ 0(k). Then an embedding can be 
given by 

J(a(X)) = XX T . 


The collection of XX T forms a subspace of . We now verify that J is an equivariant embedding 
under the group action of G = O(m). Letting g £ G = 0(m), one has J(gX ) = gXX T g T = 
0(g)J(X), where the map 0(g) = g acts on the image J(X) by the conjugation map. That is, 
0(g)J(X)=gXX T g T . 


Given the estimate F(x ), the next step is to derive the projection of F(x ) onto M = J(M). Since 
all XX T form a subspace, one can use the following procedure to calculate the map from F(x) to 
the Grassmann manifold by finding an orthonormal basis for the image. This algorithm is a special 


case of the projection via Conway embedding (St. Thomas et al. 2014). 


(1) Find the eigendecomposition F(x) = QAQ 


(2) Take the k eigenvectors corresponding to the top k eigenvalues in A as an orthonormal basis 
for F e (x ) 1 <2[l :fc; ]. 

We now consider two illustrative examples, one synthetic and one from a financial time series, for 
extrinsic kernel regression with subspace response variables. The technique is unique compared 
to other subspace regression techniques because the extrinsic distance offers a well defined and 
principled distance between responses of different dimension. This prevents having to constrain the 
responses to be a fixed dimension or hard coding a heuristic distance between subspaces of different 
dimension into the distance function. 


We now consider a synthetic example in which the predictors are the time points and the responses 
are points on the Grassmann manifold. Since we represent subspaces with draws from the Stiefel 
manifold, we draw orthonormal bases from the Matrix von Mises-Fisher distribution as their rep¬ 
resentation. We generate N draws from the following process with concentration parameter n, in 
which the first ni draws are of dimension 4 and the last 712 draws are of dimension 5, 
for 1 < t < N do 

Draw X - MN(0,I m ,I 5 ) 

M[,i] : = t + ^[,i], M[,2] := t — X{ t 2], Ml,3] := t 2 + X[ |3 ], M[,4] := tX^§ 

if t > ni then 

M[,5] '-—t + tA'[ ]5 ] 

end if 

Y t := vMF(kM) 

end for 


Here the only covariate associated with Y t is t. With a concentration of k = 1, and ni = 77.2 = 50, 
we generate much noisier data than before, and are able to correctly predict the dimension of 
the subspace at each time point. When examining the pairwise distance between the realizations 
in Figure [5j it is clear that the extrinsic distance distinguishes between dimensions and does not 
require any specification of the dimension. The predicted dimension at each time point and the 
residuals are plotted in Figure [6] 

The key advantage of this method is not requiring any constraints on the dimension of the input or 
output subspaces. This is important in some examples, such as high dimensional time series analysis 
with data such as high frequency trading where the analysis usually culminates in analyzing prin¬ 
cipal components, or eigenvectors of the large covariance matrix estimated between assets. Market 


li 








Distance Between Subspaces 



Figure 5. Pairwise distance between two observations generated by the specified 
model indexed by t measured by distance between points in the Conway embedding. 
This visualization of the extrinsic distance shows the cluster by dimension. 


events can change asset covariances which in turn changes the number of significant eigenvectors, 
so a method automatically interpolating time points must not depend on specifying the number of 
significant eigenvectors. 


We apply this method to the Istanbul Stock Exchange on UCI Machine Learning Repository |Akbil- 


gic et al. {2013), using the 5 index funds S&P 500 (SP), the Istanbul Stock Exchange (ISE), stock 


market return index of Japan (NIKKEI), MSCI European index (EU), and the stock market return 
index of Brazil (BOVESPA). The data contain 97 full weeks over 115 weeks of daily market closing 
values from January 5, 2009 to February 18, 2011. For each week, a covariance matrix is estimated 
between the assets, of which the eigenvectors with eigenvalues greater than 10 -10 are retained as 
the orthonormal basis corresponding to the covariance matrix. As can be seen in Figure [TJ these 
covariance matrices change significantly over time. These orthonormal matrices are given to the 
model along with the corresponding week and each week is predicted. 


The residuals are shown in Figure [8] compared to the observed distance between subspaces between 
two consecutive observations. The method is predicting the subspaces within the variance of data, 
which means there is information or at least structure to how the covariance matrices are evolving 
over time - the relationships are not purely random week to week. 


4. Asymptotic properties of the extrinsic regression model 

In this section, we investigate the large sample properties of our extrinsic regression estimates. We 
assume the marginal density fx(x) is differentiable and the absolute value of any of the partial 
derivatives of fx{x) of order two are bounded by some constant C. In our proof, we assume 
our kernel function K takes a product form. That is, K(x) = Ki(x 1 ) ■ ■ ■ K m (x rn ) where x = 
(x 1 ,...,x m ) and K ±,..., K m are one dimensional symmetric kernels such that f R Ki(u)du = 1, 
f R uKi(u)du = 0 and f R u 2 Ki(u)du < oo for i = 1 ,...,m. The results can be generalized to 
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Figure 6 . The estimated dimension and residual for the extrinsic kernel regression 
estimate at each time point t from data generated from the specified model. The 
regression estimate is accurate on the dimension of the subspace and prediction 
residuals are consistent with a concentration parameter k = 1. 


kernels with arbitrary form and with H given by a more general positive definite matrix instead 
of a diagonal matrix. Theorem |4.1| derives the asymptotic distribution of the extrinsic regression 
estimate Fe(x ) for any x. 


Theorem 4.1. Let ix(x) = E (p(dy\x)^J, which is the conditional mean regression function of P 
and assume /i( x) is differentiable. Assume n\H\ —> oo. Denote x = (x 1 ,..., x m ). Let Jl(x) = 


fi(x) 


z (x) 

fx(x) 


where the ith component Zfx) (i = 1,..., D) of Z(x) is given by 


df d/M 1 


=*, ( ajrjji + jH>) 


d Vi 


sv 1 ) 2 


d 2 m 

dx m x 1 


vlKifvfjdvx 


dx m dx m + 2 JX[X> 


d 2 g.i 

dx 1 x r 


_d 2 jM_ 

d(x m y 


C m Ejn { ‘C r n ) d'U TI 


(4.1) 


Assume the projection P of fi(x) onto M = J(M) is unique and P is continuously differentiable 
in a neighborhood ofjl(x). Then the following holds assuming P(dy \ x) o J -1 has finite second 
moments: 


V n \ H \dji(x)P (f{x) ~ W) ^ AT( 0 ,E(x)), 
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Week 14 


Week 27 


1 0.00055 
0.00050 
0.00045 
0.00040 
■ - 0.00035 
- 0.00030 



ISE SP NIKKEI BOVESPA EU 


1 0.00020 

0.00015 

0.00010 

! - 0.00005 

- 0.00000 

- -0.00005 

ISE SP NIKKEI BOVESPA EU 



Week 46 


Week 78 


ISE - 



NIKKEI - 

BOVESPA - 

EU - 



1 0.00012 
0.00010 

n- 0.00008 

- 0.00006 

- 0.00004 

- 0.00002 

- 0.00000 



1 0.00018 
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0.00014 
0.00012 
0.00010 
- 0.00008 
- 0.00006 
- 0.00004 
- 0.00002 
_ 1 - 0.00000 


Figure 7. The covariance matrices estimated from the daily closing prices from 
various stock market index funds over various weeks. Covariance between markets 
change substantially from week to week. 
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Lag-1 Deviations 
Residuals 


Figure 8 . The distribution of residuals from extrinsic kernel regression compared 
to the distribution of the distance between observations. The in general smaller 
residuals suggest that the changes in covariance structure is dynamic in a learnable 
way. 

where dp( x )V is the differential from T^ X )R D to T-p^ x ^M of the projection map V at Ji(x) = 
/i( x) + ■ Here £(:c) = B t T,(x)B, where B is the D x d matrix of the differential dp( x fP with 
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respect to given orthonormal bases of T^^M. D and Tpip,( x \)M, and the ( j,k)th entry of Tj{x) is 
given by (5.13) with 


Sjfc — 


a (Jj(y)i Jk{y)) J K(v) 2 dv 
fx(x) 


(4.3) 


where o(Jj(y), Jk{y)) = Cov(Jj, Jk), and Jj is the jth element of J(y). Here indicates conver¬ 
gence in distribution. 


Corollary |4.2| is on the mean integrated squared error of the estimates. 

Corollary 4.2. Assuming the same conditions of Theorem \f. 1\ and the covariate space is bounded, 
the mean integrated squared error of Fe(x) is of the order 0(n~ 4 ^ m+4 ' > ), with the choice of ht’s 
(i = 1,... ,m) to be of the same order, that is, of 0{n^ 1 ^ m+4 ' > ). 

Remark 4.1. Note that in nonparametric regression with both predictors (m-dimensional) and 
responses in the Euclidean space, the optimal order of the mean integrated squared error is 
0(n -4 /( m+4 )) under the assumption that the true regression function has bounded second de¬ 
rivative. Our method achieves the same rates. However, whether such rates are minimax in the 
context of manifold valued response is not known. 


Theorem |4.3| shows some results on uniform convergence rates of the estimator. 

Theorem 4.3. Assume the covariate space x € X C is compact and V has continuous first 
derivative. Then 

sup \\d Kx) V [f{x) - E(F(x))^j || = Op (log 1/2 n/y/n\H\^ . (4.4) 


As pointed out in Remark 1 2.3 1 it is ideal in many cases to fit a higher order (say pth order) local 
polynomial model in estimating p(x) before projecting back onto the image of the manifold. Such 
estimates are more appealing especially when F{x) is more curved over a neighborhood of x. One 
can show that similar results as those of Theorem |4.1| hold, though with much more involved 
argument. 


We now give details of such estimators and their asymptotic distributions are derived in Theorem 
Recall F(x) = E (P ( dy \ x)) and p(x) = E ^ P(dy | a:)^ and J(y n ) are the points 


4.4 


on M = J(M ) after embedding J. We first obtain an estimate F(x) of p(x) using pth order local 
polynomials estimation. The intermediate estimate F(x) is then projected back to M serving as 
the ultimate estimate of F(x). The general framework is given as follows: 


{^k( x )}o<\k\<p, 1 <j<D 
n 

= argrnin ^ (ll-%*) “ ( Pk( x )( x i ~ X ) W > • • -. Pk { x ){ x i ~ X ) W ) 

{P J k ( x )}o<\k\<p, i<j<D i—1 0<|fe|<p 0<|k|<;p 

x K H (xi - a;)). 

Some of the notation used in ( |4.5| ) are given as follows: 

m 

k = (fci,..., km), \k\ = ^2 ki, |fc| S {0,... ,p}, 

1=1 

k\ = ki\ x ... x k m \, x k = (x 1 )* 1 x ... x {x m ) k ™ 

P j 3 

E =EE-E- 

0<|fc|<p i=0fei=0 krn—O 
\k\=k 1 + ...+k rn =j 


(4.5) 

T || 2 

(4.6) 
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When k= 0, ^/3q, ■ ■ ■, /3q^J corresponds to the kernel estimator, which is the same as the estimator 
given in (2.3). When p = 1, ^/3fc =0 , ■ ■ • ,/3*? =0 ) coincides with the estimator /3 0 in (2.6). 

Finally, we have 


F(x) = A)0) = (fk= o. • ■ •. Pk= o) . 

F E (x) = J _1 (v{F{x)) s j = J _1 ( argmin||q — F(x) 


(4.7) 

(4.8) 


q£ M 


Theorem 4.4 derives the asymptotic distribution of F E (x), with F(x) obtained using pth order 


polynomials local regression of J{y i),..., J(y n ) given in (4.7). 


Theorem 4.4. Let F E (x ) be given in ( |4.8| ). Assume the (p + 2 )th moment of the kernel function 
K(x) exists and y{x) is (p+2)th order differentiable in a neighborhood of x = (ar,... ,x m ). Assume 
the projection V of p(x) onto M = J(M) is unique and V is continuously differentiable in a 
neighborhood ofjl(x), where Ji(x) = y(x) + Bias(x), with Bias(x) given in (5.23). If P(dy | i)oJ _1 
has finite second moments, then we have: 


Vff\H\d Kx) V (f{x) - p{x)^j A N(0 , S(x)), 


(4.9) 


where d^ x )V is the differential from T^ X ^ D to of the projection map V at p(x). Here 

E(x) = B T 'E(x)B, where B is the D x d matrix of the differential dm x fP with respect to given 
orthonormal basis of tangent space T^ X ^ D and tangent space Tp^ x ^M and the jkth entry of S(x) 
is given by (5.26). Here indicates convergence in distribution. 


Remark 4.2. Note that the order of the bias term Bias(x) (given in (5.23)) differs when p is even 
(see (5.21)) and when p is odd (see (5.22)). 


5. Conclusion 

We have proposed an extrinsic regression framework for modeling data with manifold valued re¬ 
sponses and shown desirable asymptotic properties of the resulting estimators. We applied this 
framework to a variety of applications, such as responses restricted to the sphere, shape spaces, and 
linear subspaces. The principle motivating this framework is that kernel regression and Riemannian 
geometry both rely on locally Euclidean structures. This property allows us to construct inexpen¬ 
sive estimators without loss of predictive accuracy as demonstrated by the asymptotic behavior of 
the mean integrated square error, and also the empirical results. Empirical results even suggest 
that the extrinsic estimators may perform better due to their reduced complexity and ease of op¬ 
timizing tuning parameters such as kernel bandwidth. Future work may also use this principle to 
guide sampling methodology when trying to sample parameters from a manifold or optimizing an 
EM-algorithm, where it may be computationally or mathematically difficult to restrict intermediate 
steps to the manifold. 


Appendix 


F(x) = 


ElLi J{yi)K H (xj-x) 
K H (xi-x) ' 
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Proof of Theorem | f.l\ Recall 


l 

n 













Denote the denominator of F(x) as 


It is standard to show 


-i n i n 

f ( X ) = ~^2 K H ( X i- X) = | rr I ^2 K ( X i ~ X )- 


f{x) —> fx(x) 


where —> indicates convergence in probability. For the numerator term of F{x), one has 

(l n \ i n 

E y']j(y i )K H (x i -x)) = - y m ]E(J(y i )K H (x i - x))) 

\ n / n 

\ 2=1 / 2=1 


1 n r 

= E ( J(yi) K H(xi - a:)) | Xi) f x (xi)dxi 

2=1 J 
1 n r 

= ~E Kxi)K H {xi - x))f x {xi)dxi 

71 i-1 •* 

= J h(x)K h (x - x))f x {x)dx. 


(5.1) 


Noting that y(x) = {y i(x),..., y d{x ))' G R 15 , we slightly abuse the integral notation above meaning 
that the jth entry of E ( n~ l 1 J(yi)K H (xi — x))) is given by 

f Hj(x)K H (x - x))f x {x)dx. 


Letting v = H x (x — x) by changing of variables, the above equations become 

E ^ y]j(yi)K H (xi - x))j = Jn(x + Hv)K(v)f x (x + Hv)dv. 

By the multivariate Taylor expansion, 

fx(x + Hv) = f x (x) + (v/) • (Hv) + R , (5.2) 

where \/f is the gradient of / and R is the remainder term of the expansion. The remainder R can 
be shown to be bounded above by 

R < ^\\Hv\\ 2 , ||-ffu|| = \h\V\\ + ... \h m v m \. 

Note that y(x + Hv) is a multivariate map valued in $L D . We can make second order multivariate 
Taylor expansions for y(x + Hv) = (yi(x + Hv),... ,hd{x + Hv))' at each of its entries /q for 
i = 1,..., D. We have 


y(x + Hv) = y(x) + A(Hv) + V + R, (5-3) 

where A is a D x m matrix whose ith row is given by the gradient of /q evaluated at x. V is a 
D-dimensional vector, whose itli term is given by |( Hv) f Ti(Hv ), where X) is the Hessian matrix of 
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(5.4) 


Hi{x) and R is the remainder vector. Thus, 
E ( - y ^J(yi)K H (xj - x)) J 


i= 1 


« J ( {fx{x ) + (v/) • (Hv))K(v)(fj,(x) + A(Hv) + V)) dv 

= f x {x)n{x) + fx(x) J K(v)A(Hv)dv + f x (x) J K(v)Vdv (5.5) 

+ fi{x) J (v/) ' (Hv)K(v)dv + J (v/) ’ {Hv)K(v)A(Hv)dv + J (v/) ' ( Hv)K(v)Vdv. (5.6) 


By the property of the kernel function, we have f K(u)udu = 0; therefore the second term of 
equation (|5.5|) is zero by simple algebra. To evaluate the third term of equation (5.51, we first 


calculate for J K(y)Vdv. From here onward until the end of the proof, we denote x = (ad,..., x m ) 
where x l is the ith coordinate of x. Note that the ith term of V (i = 1 ,...,£)) is given by 

-(HvyTi(Hv), where T, is the Hessian matrix of /Xj, which is precisely 


2 klVl ^(x 1 ) 2 


+ ...+ 


8 2 im 


1 


+ + —h 2 v 2 

dx m x 1 J 2 m m V dx 1 x™ 


( d 2 m 


d 2 m 

d(x m )'- 


Therefore, the xth entry of the third term of equation ( |5.5| ) is given by 

d 2 /x ; 


U, = \fx (*)().? +... + ) / «?*i («.)*. 


(5.7) 


+ K 


JPjM 

dx 1 x r 


d 2 m 

d(x m y 


V-rn, E m {Xm ) dV-p 


The first term of equation (5.6) is given by 


li{x) J (y/) ■ (Hv)K(v)dv = J + ■.. + K(v)dv = 0. 


The xth entry of the second term of equation (5.6) is given by 


u 2 9f d/J,i f 2 js ( , , ,2 d f d m f 2 TS t ^ 7 tt , 

kl dxL d^J VlRl ^ dVl+ --- + hm d^d^ J V mEm(.V m )dv m . (5.8) 


The third term of equation (5.6) can be shown to be zero, since odd moments of symmetric kernels 
are 0. Therefore, we have 


E 'Y^J(y i )K H {x i - x))J « f x {x)n(x) + Z , 
where the xth coordinate of Z is 

Zi =/ll { —T W-T + ~ tx(X ) | 77-+ ■ • • + 


(5.9) 


,2 (dfd^,l f d 2 Hi 

" ^dx 1 dx 1 + 2 /x(x) (^(x 1 ) 2 


, , d 2 Hj 

dx m x 1 


v{K 1 {v 1 )dv 1 


df dm 
dx m dx m 


lfx(x) 


( d 2 ^ 
\dx 1 x rn 


d 2 m 

d(x m )‘‘ 


X rn Rm('Um')dll n 


(5.10) 


combining equations (5.7) and (5.8). The reminder term of (5.2) is of order o(max{/ii,..., h m }) 


and each entry of the remainder vector in (5.3) is of order o(max{/i 2 ,..., h^}). 


We now look at the covariance matrix of n 1 X]"=i J(Ui)KH{xi — x)), which we denote by E(x). 
Denote the jth entry (j = 1,...,D) of J(yi) as Jj{yi)- Denote fx(j/- 7 , y k ) as the conditional covariance 
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between the ith entry and jth entry of y. We have 
Sjfc = E 


1 V—-V / 1 

- 55 Jj(Vi)K H {xi - x)) - E I - 55 Jj(yi)K H (xi - a;)) 

71 i—1 \ i=l / 


- 55 Jk{yi)K H {xi — x)) — E i — 55 Jk{yi)K H (xi - x)) 


i=1 


i=1 


= E 


i=i 


55 ( J AVi) K H{Xi - x)) - / Hj(x)Kff(x - x)f x (x)dx 


— 55 ( J k(yi) K H{Xi - x)) - / Hk{x)K H {X - x)fx{x)<E 


= ~ / E 
n . 


J j (yi)K H (x 1 - x)) - / Hj(x)K H (x - x)f x (x)dx 


Jk{yi)K H (xi-x)) - J n k (x)K H (x- x)f x (x)dxj \ x\ f x (xi)dxi 
= ^ J cr{J j (y 1 )K H (x 1 - x)), Jk(yi)K H (xi - x))f x (x 1 )dx 1 
= ^ J K h (x 1 - x)) 2 a(Jj(yi), Jk{y\))fx{x\)dx 1 . 

By the change of variable v = H~ l (x\ — x), the above equation becomes 


Sjfc — 


nH 


K{v) 2 a(Jj(y v ), Jk(y v ))fx(Hv + x)dv 


E J K(v) 2 a {Jj(yv),J k (y v )) ifx(x) + v/ • (Hv) + o(max{/u,.. .,h m }))dv 


nH 


K{v) 2 a(Jj(y v ), Jk(y v ))fx{x))dv + o 


nH 


(5.11) 


By (5.1), (5.9) and (5.24), and applying central limit theorem and Slustky’s theorem, one has 

y/n\H\ (E(x) ~ ?(®)) A JV(0, £(*)), (5.12) 


where y(x) = y(x) + y and the zth entry (i = 1,..., D) of Z is given by (5.10) and 


Sjfc — 


v{Jj(yv),Jk{yv)) J K(v) 2 dv 

fx(x) 


(5.13) 


One can show 

V n \H\ (Fe{x) - V (ji(x))j = \/n\H\dji( x )V (f(x) - y( x )) + o P (l). 

Therefore, one has 

\fn\H\dy,{ x) V (f(x) - y(x)^j A N( 0, E(x)). (5.14) 

Here £(x) = B T Y,(x)B , where B is the D x d matrix of the differential <%( X )’P with respect to given 
orthonormal bases of 2]j( x )R D and Tp-jxi x )M . 

□ 

Proof of Corollary \f.S\ In choosing the optimal order of bandwidth, one can consider choosing 
(hi, ..., h m ) such that the mean integrated squared error is minimized. Note that 


F e (x) - F(x) = Jacob('P) ju(x ) (f(x) - n(x)j + o p ( 1). 


(5.15) 
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Here Jacob('P) is the Jacobian matrix of the projection map V . One has 

MISE(F e (x)) = J E\\F e (x) - F{x)\\ 2 dx 

= J EpacobiV)^) ( F(x ) - (i{x )) + o p (l)|| 2 cte 



= 0(l/n\H\) + ... + 0(l/n\H\) + 0(h\) + ... + 0(h 4 J. 


The last terms follow from Fatou’s lemma, and that the Jacobian map is differentiable at for 

every x. Therefore, if h^s (i = 1,..., m) are taken to be of the same order, that is, of 

then one can obtain MISE(i r E(a;)) with an order of 0(n~ 4 ^ m+4 ^). □ 


Proof of Theorem 4.3 Let B be the D x d matrix of the differential d^ x )V with respect to given 
orthonormal basis of tangent space TjxmM. d and tangent space T-p-p,i x )M. Given a canonical choice 
of basis for tangent space one has the representation for 


sup \\dft x) V (f(x) - E(F(x)j) || = sup 

X ' ' X, 



d ( D 

)H=sup 

E £ s «( 

\ 

i=i V =1 


(5.16) 


Note that the projection map is differentiable around the neighborhood of n(x) and X is compact, 
so Bjj(x) are bounded. Let = sup xeX (Bfj) 2 (x) and C = maxCy. For each term note that, by 
Cauchy-Schwarz inequality, 

2 

< sup^(H^) 2 (Fj{x) - E (Fj (a;)} ' 


/ D 

SUP 51 ( B h) “ E 

xGX > - ' v ' 


G =1 


J=i 

D 


< 


C V sup (Fj( x ) - E (e,-(x)) 
r—; xex K K > 


i =i 


By Theorem 2 in Hansen (2008), one can see that 

sup | (Fj(x) - E (Fj(x)^ | = 0(r n ), 

where r n = log 1 / 2 n/\Jn\H\. Then one has 

;E f E ( B H - E (^(*)))) I = 0(r 2 ). 


sup 

xex 


i=l \j=1 


Then one has 

sup \\d^ x )V (f(x) - E(F{x))) II = sup 


d D 


^ E ( E ( Bl i (Ej( x ) E ( Ej(x) 


=i \ j —i 


0 (r n ) =o(log 1 / 2 n/V^H\) ■ 


(5.17) 

(5.18) 

(5.19) 

(5.20) 


□ 


Proof of Theorem 4-4 Given the higher order smoothness assumption on [i(x ), one can make higher 


order approximations and using a local polynomials regression estimate would result in the reduction 
of bias term in estimating /r( x ). The asymptotic distribution for multivariate local regression 
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estimator for Euclidean responses has been derived (Gu et al. 2014 Ruppert and Wand 1994 


Masry 19961, and we leverage on some of their results in our proof. 

Note that F(x) = ^Fi(x),..., Fd(x)^J G R D . E(F(x)) = ^E(Fi(x)), ... ,E(Fd(x))J and the 

expectation taken in each component is with respect to the marginal distribution of P(dy \x). Then 
by Theorem 1 of Gu et al. (20141, the following holds: 


(1) If p is odd, then for j = 1,... . D 

Bias j(F(x)) = E(Fj(x)) — Pj(x) 

= [M- l B p+1 H^rrv’ p+1 {x)) ^ (5.21) 

which is of order 0(||/i|| p+1 ). Here (-)i represents the first entry of the vector inside the 
parenthesis; 


(2) If p is even, then for j = 1,.... D 

Bias j(F(x)) = E(Fj(x)) — p.j{x) (5.22) 

E hl Pr\ { M v lB v +1 - M~ 1 A4 l p M- 1 B p+1 ) JL (p+1 W p+1 (z) + M^B^H^m^x) 
i=i J x y x ) 

which is of order 0(||fa|| p+2 ). 

For any k G {0,1,..., p}. Let N k = (E-i*) an d = J2k=o N k- Here M p is a Af p x Af p 
matrix whose (i,j)th block (0 < i,j < p) is given by J Rm u l+ i K(u)du and M l p (l = 1 ,...,m) 
is a Afp x Af p matrix whose (i,j )th block (0 < i,j < p ) is given by f Rm uiU z+J K(u)du. B p+1 is 
a Afp x N p+ 1 matrix whose (i,p + l)th (i = 1,... ,p) block is given by J Rm u l+p+1 K(u)du and 
B l p +i {l = 1 ,m) is a Af p x N p+ 1 matrix whose (i,p + l)th (i = 1 ,... ,p) block is given by 

/ Rm uiu l+p+1 K(u)du. We have H (p+1) = Diag {h\ +1 ,... ,h p + 1 }. fi(x) = an d rn p+ i( x ) 

(,j = 1 ,...,£>) is the vector of all the p + 1 order partial derivative of yj(x ), that is, m 3 pJrl {x) = 
dfi p+1 (x) dp p+1 (x) \ 

d{x 1 ) p+1 ' d(x 1 ) p d(x 2 ) ’ ’ d(x m ) p+1 J 

With BiaSj(F(a:)) (j = 1,D) given above, one has 

Bias(x) = E(F(x))) — p{x) = ^Biasi(F(a;)),..., Bias_o(F(a;))^ . (5.23) 




Although higher order polynomial regression results in the reduction in the order of bias with the 
higher order smoothness assumptions on p(x), the order and expression of the covariance remains 
the same. That is, 

Y, jk = Co v(Fj(x),F k (x)) 

= J K(v) 2 e(Jj(yv), MVv))dv + o > ( 5 - 24 ) 

where <j(Jj(y v ), J k {y v ) is the covariance between Jj(y v ) and J k (y v ). 


Applying the central limit theorem, one has 

\Jn\H\ (^F{x) — y{x) — Bias(a;)^ N(0, S(x)) 


(5.25) 


where the jth (j = 1,..., D) entry of Bias(x) is given in (5.21) or (5.22) depending on p is odd or 
even, and 


Hjfc — 


v{Jj(.yv)iJk{yv)) f K{v) 2 dv 
fx(x) 


(5.26) 
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Letting p(x) = n(x) + Bias(:r), one has 

Vn\H\ (f e (x) - V (£(*))) = V n \H\d^ x) V (f(x) - p{x^j + o P ( 1). 

Therefore by applying Slutsky’s theorem, one has 

sfiW\d Kx) V (f{x) - £(*)) 4 N{ 0, E(z)). (5.27) 

Here E(:r) = B t T,(x)B where B is the D x d matrix of the differential dp^P with respect to given 
orthonormal bases of the tangent space and tangent space Tjp x) M. 

a 
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